Hydropathy Conformational Letter and its Substitution Matrix 
HP-CLESUM: an Application to Protein Structural Alignment 
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Motivation: Protein sequence world is discrete as 20 amino acids (AA) while its structure world is 
continuous, though can be discretized into structural alphabets (SA). In order to reveal the relationship 
between sequence and structure, it is interesting to consider both AA and SA in a joint space. However, such 
space has too many parameters, so the reduction of AA is necessary to bring down the parameter numbers. 

Result : We've developed a simple but effective approach called entropic clustering based on selecting the 
best mutual information between a given reduction of A As and SAs. The optimized reduction of AA into two 
groups leads to hydrophobic and hydrophilic. Combined with our SA, namely conformational letter (CL) of 
17 alphabets, we get a joint alphabet called hydropathy conformational letter (hp-CL). A joint substitution 
matrix with (17 * 2) 2 indices is derived from FSSP. Moreover, we check the three coding systems, say AA, 
CL and hp-CL against a large database consisting proteins from family to fold, with their performance on 
the TopK accuracy of both similar fragment pair (SFP) and the neighbor of aligned fragment pair (AFP). 
The TopK selection is according to the score calculated by the coding system's substitution matrix. Finally, 
embedding hp-CL in a pairwisc alignment algorithm, say CLeFAPS, to replace the original CL, will get an 
improvement on the HOMSTRAD benchmark. 

Contact*: wangsheng@itp.ac.cn 



I. INTRODUCTION 

Proteins fold into specific spatial conformations to per- 
form their biological functions [l[ and there are abundant 
evidences to show their amino acid (AA) sequences de- 
termining the structures Q. The attempt to find the 
relationship between structure and sequence is a funda- 
mental task in computational biology [J]. 

Compared to the sequence world which is discrete of 
20 AAs, the structure world is continuous, though the lo- 
cal conformational space of a protein backbone fragment 
is rather limited Q . The idea of representing the back- 
bone with a string of discrete letters was first observed 
by Corey and Pauling 0, 0] and later refined into the 
concept of protein secondary structure elements (SSEs). 
However, segments of a single SSE may vary significantly 
in their 3D structures, especially for the state coil, which 
is not a true secondary structure but is a class of confor- 
mations that indicate the absence of regular SSEs, say 
alpha helix or beta strand [23j. Although the SSE can 
be predicted with high accuracy (>80%) Q, the descrip- 
tion of a protein in terms of its SSEs is not sufficient to 
capture accurately its 3D geometry 20] . 

To overcome this limitation, several groups have pro- 
posed the idea that representing protein structures as a 
series of overlapping fragments, each labeled with a sym- 
bol, which defines a structural alphabet (SA) for proteins 
lcHL3|. Such alphabet can be used to predict local struc- 
ture p~5Ml7l ] . to reconstruct the full-atom representation 
[HI, to identify the structural motifs [f9|, to classify 
protein structures [2(| and to search against a database 
[211 r22l] . We've proposed our SA, namely conformational 
letter (CL) [l4j], which is composed of 17 alphabets and 
each with 4 residues in length. Our SA is focused on the 
fast pairwise [24j . multiple [25[ and flexible (2(| struc- 
ture alignment problems, combined with its substitution 
matrix CLESUM QJ]. 

After we discretized the continuous structure world 



into SAs, it is the time to consider both AA and SA in 
a joint space. However, such space is too large for about 
(20 * 17) 2 parameters when using the current popular 
SAs. ft is necessary to employ the reduction of AAs [27l ]. 
Several groups have put forward their reduced AAs either 
experimentally or computationally. For example, Baker 
et. al found a five-letter alphabet for 38 out of 40 selected 
sites of SH3 chain Wang & Wang ^ introduced 
the minimal mismatch principle to reduce the alphabet 
based on Miyazawa-Jernigan's residue-residue statistical 
potential |3C|; Murphy et.al [3l[ approached the same 
problem using the BLOSUM matrix [32]. Recently, dc 
Brevern et.al proposed to use their SA, namely Protein 
Blocks |10] to analyze equivalences between the different 
kinds of amino acids, then obtained their reduced AAs 

Here we present a novel reduction method, called en- 
tropic clustering [34[. Briefly, given two discrete distri- 
butions A and B, merging a; and aj into one group ai&j 
will result in a loss of mutual information of A and B. 
Thus, mutual information / can be naturally chosen as 
the objective function for optimized clustering. When 
grouping the 20 AAs into two categories, we've got a re- 
sult of hydrophobic and hydrophilic, which agrees with 
HP-model [35[ exactly. Then we construct a joint substi- 
tution matrix HP-CLESUM with (17 * 2) 2 indices by the 
similar means as constructing CLESUM. 

The following tests are employed to check and com- 
pare different coding systems, namely AA, CL and hp-CL 
with their corresponding substitution matrix, say BLO- 
SUM, CLESUM and HP-CLESUM. We first compare the 
TopK accuracy of SFPs (similar fragment pairs) and the 
neighbor of AFP (aligned fragment pairs) against a large 
dataset encompassing the protein homologous levels from 
family to fold accor ding to SCOP (46[; then we embed hp- 
CL into CLeFAPS [26|, replacing the original CL, to get 
an improvement against the popular benchmark HOM- 
STRAD m . 



2 



State 




alpha 



beta 



^il 





K 


|zp" 


Q 




T 


0' 


00 








TT 




0*1 




T 




-) B 




I 


8.2 


1881 


1. 52 





83 


1. 52 


275 


4 


-28 


3 


84 


3 


106 


9 


-46 


1 


214 


1 




7.3 


1797 


1. 58 


1 


05 


1. 55 


311 


3 


-10 


3 


46 





37 


8 


-70 





332 


8 


H 


16. 2 


10425 


1. 55 





88 


1. 55 


706 


6 


-93 


9 


245 


5 


128 


9 


-171 


8 


786 


1 


K 


5.9 


254 


1. 48 





70 


1. 43 


73 


8 


-13 


7 


21 


5 


15 


5 


-25 


3 


75 


7 


F 


4.9 


105 


1. 09 


-2 


72 


0. 91 


2 1 


1 


1 


9 


10 


9 


-11 


2 


-8 


8 


53 


() 


E 


11.6 


109 


1. 02 


-2 


98 


0. 95 


31 


3 


4 


2 


15 


2 


-9 


3 


22 


5 


56 


8 


C 


7. 5 


100 


1. 01 


-1 


88 


1. 14 


28 





4 


1 


6 


2 


2 


3 


~ 5 


1 


69 


1 


D 


5.4 


78 


0. 79 


-2 


30 


1. 03 


56 


2 


3 


8 


4 


2 


-10 


8 


-2 


1 


30 


1 


A 


4.3 


203 


1. 02 


-2 


00 


1. 55 


30 


5 


9 


1 


8 


7 


6 





5 


7 


228 


6 


B 


3.9 


66 


1. 06 


-2 


94 


1. 34 


26 


9 


4 


6 


4 


9 


9 


5 







51 


:i 


G 


5.6 


133 


1. 49 


2 


09 


1. 05 


163 


9 





6 


3 


8 


2 





-3 


7 


32 


3 


L 


5.3 


40 


1. 40 





75 


0. 84 


43 


7 


2 


5 


1 


4 


-7 





-2 


9 


34 




u 


3.7 


144 


1.47 


1 


64 


1. 44 


72 


9 


2 




4 


8 


1 


9 


-7 


9 


72 


!) 


N 


3. 1 


74 


1. 12 





14 


1. 49 


25 


3 


3 


2 


3 


1 


9 


9 





9 


83 








2. 1 


247 


1. 54 


-1 


89 


1. 48 


170 


8 


-0 


7 


3 


7 


-4 


1 


3 


1 


98 


7 


P 


3. 2 


206 


1. 24 


-2 


98 


1. 49 


18 





8 


2 


7 


3 


-4 


9 


-6 


6 


155 


6 


Q 


1. 7 


25 


0. 86 


-0 


37 


1. 01 


28 


4 


1 


5 


1 


2 


3 


4 





1 


19 





i =arg. maxP(C. \x k ), where x k = (d,T,0 r ) 



(C)| 



>lmolA 

RRFEDECCGAIHHHHHHHHHHHHHHHOMICQEECBLDFQNB1 LLELrLQNNGCPLDDEEEDEEENOGl ElUiEEEliEPKKOGFEDPLOEQBGCCR 



FIG. 1: The conversion from 3D protein structure to ID CL string. (A) Given a sliding window of length four, may we get four contiguous C a 
atoms, and determine two bending angles #, and a torsion angle r. (B) Select a state which maximize the given data, where each state is a 
gaussian distribution in the three dimension space of 9, t, 0' . (C) Assign the state (letter) to the third place of the four C a atoms, so the TV length 
protein will get N — 3 letters finally, and we assign the first two and the last position with a 'blank' letter R. 



II. MATERIALS AND METHOD 

A. Datasets 

We use PDB-SELECT databank ^ to construct our 
CLs, and use FSSP database [38| to derive the substitu- 
tion matrix CLESUM. Particularly, PDB-SELECT con- 
tains 1544 non-membrane proteins from PDB [32j with 
amino acid identity less than 25%. FSSP is based on ex- 
haustive all- against- all structure comparison of the rep- 
resentative protein structures, where the representative 
set contains no pair which has more than 25% sequence 
identity. A tree for the fold classification of the 2,860 rep- 
resentative set is constructed by a hierarchical clustering 
method based on the structural similarities. Family in- 
dices of the FSSP are obtained by cutting the tree at 
levels of 2, 4, 8, 16, 32 and 64 standard deviations above 
the database average. 



angles, the local structural states from PDB-SELECT 
have been clustered as 17 discrete states (see our previ- 
ous work [14| for more details). To use our SAs directly 
for the structural comparison, a score matrix similar as 
BLOSUM [H for AAs is desired. In details, we first con- 
vert the structures of the representative set from FSSP 
to their CL strings; then collect all the pair alignments 
with the same first three family indices (DALI Z-Score > 
8) (see Fig. [5]); finally count all ungapped aligned pairs 
of CLs to generate the substitution matrix, say CLESUM 
(Table|T]). The total number of structures is 10,047 pairs, 
consisting of 175,723 fragment pairs and 1,284,750 code 
pairs. 



C. Entropic clustering 



B. Conformational letter and its substitution 
matrix 

Four contiguous C a atoms, say i — 2, i and i + 1, 

determine two bending angles 9, 0' and a torsion angle r 
which is the dihedral angle between the two planes of tri- 
angles i — 2,i—l,i and i — 1, i, i + 1 (see Fig. [Ij. By using 
a mixture model for the density distribution of the three 



From FSSP, which contains also the AA information, 
it is possible to construct a substitution matrix in the 
joint space of the structure and sequence. However, such 
matrix would have about (17x20) x (17x20) parameters 
(Fig. EI A)). If we group the 20 AAs into two clus- 
ters, then the parameters of the matrix are reduced to 
(17x2)x(17x2). 

Generally, the mutual information / of two discrete 
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Fragments in FSSP 

The i-th fragment 
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(E) 



(D) 



CL Occurrence Matrix 

(17* 17) 
q ABC D E... CL: 

Naa Nab Nac Nad Nae . 
Nba Nbb Nbc Nbd Nbe . 
Nca Neb Ncc Ned Nee . 
Nda Ndb Ndc Ndd Nde . 
Nca Neb Nec Ned Nee . 



FIG. 2: Flowchart of the construction of CLESUM. (A) Collect all pairwise structures with the same first three family indices (DALI Z-Scorc > 
8) in the representative set from FSSP. (B) For each pair, extract the alignment. (C) For each alignment, extract the ungapped aligned fragment. 
Each fragment contains both AAs and CLs. (D) Count the occurrence number of CL duad. For example, Nab means the total number of CL A 
and B occurred in the alignment. (E) Calculate CLESUM from the occurrence matrix. 



Fragments in FSSP 

The i-th fragment 
[AAL..AVDLVLIMS... T_ 
[CLL..KLBLDEEEP... J 
III 

[CL] ...OGPLEEFDA... ~\ 
[AA] . . .GYEWGLDW. . . J 
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Fragments in FSSP 

The i-th fragment 
[lag]. ..001000001... 
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FIG. 3: Flowchart of cntropic clustering on the joint space of both AAs and CLs. (A) For each fragment in FSSP, count the pairwise number 
of joint occurrence. For example, Ndaac means the total pairwise number of 'da' with 'ac', in each duad the former is CL and the latter is AA. 
(B) Use Monte Carlo to randomly group 20 AAs into two categories. (C) Given an AA category, assign each AA to its group. (D) Calculate the 
reduced occurrence matrix, here NdOal means the total pairwise number of 'd0' with 'al', in each duad the former is CL and the latter is AA's 
tag. (E) Calculate the mutual information of the reduced matrix. (F) If the categories which maximize I have been found, break the Monte Carlo 
recursion. (G) Calculate the HP-CLESUM. 



distribution (A and B) is denned as, tering is given by 
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TABLE I: CLESUM: The conformation letter substitution 
matrix (in 0.05 bit units). 
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which, by introducing 

= P(ai,b) 

_ p{a % ) 
p{ai) +p(a J ) 

and their analogs Xj and luj, then defining (F(x)) = 
uJiF(xi) + ujjF(xj) and F((x)) = F{ujiXi + ujjXj) where 
LJi + LOj = 1. We may now see that Eq. ([3]) is propor- 
tional to f((x)) — (f{x)) with f(x) = xlogx. From the 
Jensen's inequality, for the convex function x log x here 
we have f({x)) < (f(x)), so / never increases after any 
step of clustering. 

That is to say, merging any two members into one clus- 
ter will result in a loss of mutual information. To make 
the loss of mutual information as small as possible, I 
should be maximized, so it can be naturally chosen as 
the objective function during clustering. We call this ap- 
proach entropic clustering [34}. If we partition n objects 
into mi and m,2 classes, where m2 > m\, it is easy to 
prove that the maximal I at is always greater than 
the maximal / at mi [23j|. 

Now turning back to our substitution matrix in the 
joint space, we may define the average mutual informa- 
tion as follows like BLOSUM, 

X,Y 

where X and Y means a joint state of CL and AA (either 
reduced or not). Given a clustering group may we cal- 
culate its / based on Eq. ((5|) and according to entropic 
clustering we should get a categories which maximize / 
(Fig. MF)). 



III. RESULT 

A. Joint substitution matrix of conformational 
letters and reduced amino acids 

For clustering the 20 AAs into two categories, the 
Monte Carlo finds AVCFIWLMY and DEGHKNPQRST as the 
groups, which is just the hydrophobic and hydrophilic 
cluster [35j]. Such enlarged CLs are called hp-CLs. 



TABLE II: CLESUM-hh (lower left) and CLESUM-pp (upper 
right) (in units of 0.05 bit): 'black dot' and 'white cycle' to 
indicate the the CLs with different hydropathy AA types. 
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TABLE III: CLESUM- hp (row-column) (in units of 0.05 bit). 
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-65 


-47 


-8 


12 


25 


65 


17 


-3 


-13 


-8 


-10 


10 


-4 


-26 


-24 


-29 


G 


-20 


-54 


-35 


-15 


-17 


6 


16 


52 


12 


-5 





-8 


21 


-5 


-31 


-13 


-37 


M 


11 


-13 


-5 


4 


12 


-1 


-3 


9 


47 


-17 


4 


-4 


13 


-16 


-54 


-41 


-54 


B 


-54 


-95 


-75 


-53 


-19 


-1 


-19 


5 


-21 


34 


35 


18 


4 


6 


-2 


-3 


-1 


P 


-34 


-62 


-50 


-36 


-6 





-19 


4 


1 


21 


53 


32 


17 


-1 


-22 


-16 


-27 


A 


-26 


-49 


-37 


-24 


7 


6 


-19 


-8 


-7 


2 


26 


66 


26 


2 


-37 


-30 


-31 


6 


-35 


-62 


-43 


4 


9 


-21 


-7 


-4 


-3 


-35 


-12 





67 


-22 


-69 


-55 


-57 


c 


-42 


-74 


-57 


-36 


-10 


30 


-4 


-8 


-12 


-3 


1 


7 


9 


42 


-14 





2 


E 


-81 


-116 


-95 


-80 


-44 


-6 


-23 


-24 


-43 


12 


-1 


-21 


-19 


8 


25 


18 


14 


F 


-64 


-95 


-79 


-62 


-20 


2 


-15 


-9 


-30 


4 


5 


-18 


2 


21 


14 


33 


14 


D 


-84 


-114 


-100 


-79 


-30 


20 


-25 


-32 


-45 


11 


-3 


-17 


-29 


23 


9 


9 


34 




j 


H 


/ 


k 


N 


Q 


L 


G 


M 


B 


P 


A 


6 


C 


E 


F 


D 



The substitution matrix of hp-CLs is called HP- 
CLESUM, this symmetry matrix can be divided into 
three sub-matrices: CLESUM-hh, CLESUM-pp, and 
CLESUM-hp. The first two, shown in Table III corre- 
spond to the same hydropathy aligned amino acid types 
(i.e., h-h and p-p). The third, shown in Table ILUl cor- 
responds to the different hydropathy types h-p. As ex- 
pected, compared with the original CLESUM (Table HJ , 
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elements of CLESUM-hh and CLESUM-pp generally be- 
come larger in absolute values, and those of CLESUM-hp 
show the opposite tendency. The tendency is stronger for 
letters dominated by helices or sheets. 



B. Comparison between different coding systems 

1. Overview 



TopK SFP 's Accuracy Check j^Us; 
SFPUi, jj;L) 
F 1 (C) 



SFP-list by 
rank 





Topi: 




(E) 


Top2: 


Partial 


Top3: 


Error 




Top4- 






SFP (13, jjiL) 



AFP's neighbor list 

TopK AFP 's Neighbor Accuracy Check 



Topi 


SFP (13, 12;L) 




Topi: 


Exact 


Top2 


SFP(13, 78;L) 


(H) 


Top2: 




Top3 


SFP(13, 13;L) 


Top3. 


Partial 




SFP(13, 47;L) 




Top4: 


Error 



AFP's neighbor 
list by score rank 



FIG. 4: TopK accuracy check procedure. (A) Select a pair of struc- 
tures and aligned by MATT, resulting a series of AFPs. (B) Encode the 
pairwisc structures, the coding system may be AA or SA. (C) Search 
for any SFPs with length L according to a certain coding system. (D) 
Sort these SFPs in descending order, the score is calculated by the 
corresponding substitution matrix. (E) Check whether there exists a 
correct SFP within TopK. (F) Search for any possible neighbor of a 
given AFP whose length is over L against the other string. (G) Sort 
these AFP's neighbors. (H) Check their accuracy. 

A coding system in protein structure is defined as an 
alphabet combined with its corresponding substitution 
matrix. Amino acids (AA) or all kinds of SAs can be 
treated as coding systems, so long as the alphabet has 
its substitution matrix. We'll compare the performance 
of the following three ones, namely AA, CL and hp-CL, 
based on their TopK accuracy against a benchmark (Fig. 
H}. The difference between SFP and AFP is, we use SFPs 
to describe all local similar fragment pairs, while AFPs 
is a subset of SFPs that each of them should be in the 
final alignment [26J. 

Note if the length of an AFP, say Len is longer than 
L, we'll check each positions and the total number is 
Len — L + 1; if Len is shorter than L, we just skip this 
AFP. As a result, the TopK SFPs' accuracy of a single 
pair of structures is a or 1 measure, that is to say, within 
TopK we find a correct SFP or not. While the TopK 
neighbor of AFP's accuracy is calculated by summing all 
correct positions found in any AFPs then dividing the 
total valid positions, the result is between 0.0 to 1.0. 

The benchmark we use is divided in three levels: fam- 
ily, superfamily and fold according to SCOP (46[. In fam- 
ily set, we use all SCOP families which have 2 to 25 mem- 



bers in ASTRAL 40% compendium [13] and the total pair 
number is 21,039. In superfamily and fold, it is conve- 
nient to use SABmark [43| instead of using ASTRAL 40% 
because SABmark is systematically arranged and elabo- 
rately checked at both superfamily and fold level. The 
superfamily set contains 3,645 domains sorted into 426 
subsets and the fold set (or be called twilight zone [48() 
contains 1,740 domains sorted into 209 subsets, where 
each subsets contain between 3 to 25 structures. The 
superfamily set contains 18,724 structure pairs and the 
fold set contains 10,306. We apply MATT [H] to con- 
duct all-against-all pairwise alignment within each family 
or subset as our gold standard. 



2. Performance 

Table IIVI shows that, hp-CL performs best while CL 
follows the second, both of them outperforms 10% to 
100% than AA. For details, with the increase of TopK 
and length L, the accuracy of all coding systems grows 
better, while from family to fold level, the accuracy de- 
clines. It is surprising that at fold level, the accuracy of 
hp-CL outperforms AA more than 50% at the TopK SFP 
accuracy test and more than 100% at the TopK neighbor 
of AFP test, while in the latter, hp-CL got the average 
accuracy at about 71% given L = 18 from the Top-1 high- 
est neighbor of an AFP. Such feature may be applied to 
construct the Highest Similarity Fragment Block (HSFB) 
during the multiple structure alignment [23]. Given a 
seed structure and a certain position, if this position got 
many high score neighbors in other structures, may we 
say that this block (consisting the seed position and its 
neighbors) has a more probable chance in the final mul- 
tiple alignment. 

Moreover, we've shown the effectiveness of parameter 
self-adaptive strategy to create the SFP-list in [26|. At 
most cases self-adaptive strategy is compatible with fixed 
parameters, while the size of the SFP-list can be con- 
trolled empirically to about 0(n 2 /LEN_H/6) with the 
LEN_H=9 (Fig. [5]). However, its hard to control the bal- 
ance between the size of the SFP-list and the threshold of 
SFP generated with fixed parameters strategy. Actually, 
the data of fixed length used in Table IIVI is considered 
all 0(n 2 ) SFPs, then to select TopK; we've tested differ- 
ent SFP thresholds, if it is set too high there'll lead to 
blank or few SFP-list while if it is set too loose then the 
SFP-list will be too much (data not shown). 

Finally, we may get the conclusion that, during struc- 
ture comparison, the only consideration of the TopK 
highest SFPs to built the initial alignment is feasible 
from family level to fold, so long as the coding system 
is specific enough. Also the employment of parameter 
self-adaptive strategy to generate SFPs is effective and 
economic. 
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TABLE IV: TopK accuracy check with different strategies, from TopK SFP check (left part) to TopK AFP's neighbor check 
(right part); different coding systems, from AA (A%), CL (C%) to hp-CL (H%); different homologous level, from family (Fam), 
superfamily (Sup) to fold; and different SFP length L, from 6, 12 to 18. 



Level 



TopK 



TopK SFP's Accuracy 



L = 6 



L = 12 



L = 18 



Self(9-18)* 



A% C% H% 



H9 



A% C% H% 



C% 



TopK 



TopK AFP's Neighbor Accuracy 



L = 6 



L = 12 



L = 18 



c% 



H9 



A% C% H% A% C% H% 



Fam 



1 

5 

10 

20 



53.1 63.1 

74.4 86.7 

82.0 93.1 

89.1 97.1 



68.0 62.4 

88.9 79.4 

94.4 85.7 

97.8 91.8 



74.4 77.6 62.4 

90.7 92.3 80.1 

95.4 96.2 86.6 

98.2 98.6 92.8 



74.3 77 
91.2 92 
96.0 96 

98.4 98 



1 73.1 75.5 

4 90.1 91.6 

6 95.2 96.0 

8 98.0 98.5 



25.8 41.4 46.3 
35.1 53.3 59.0 
41.1 59.9 65.9 
45.7 64.4 70.5 



46.1 
55.8 
61.4 
65.2 



69.7 74.2 

79.3 83.4 

83.4 87.3 
85.9 89.6 



62.4 
71.0 
75.4 
78.3 



86.0 88.5 
91.7 93.7 
93.9 95.6 
95.0 96.6 



Sup 



1 
5 

10 

20 



39.1 48.7 

58.2 73.7 
67.0 83.1 
77.2 90.5 



52.4 44.8 

76.7 62.4 
85.3 71.1 

91.8 81.0 



58.3 61.8 43.8 

79.2 81.8 63.0 

87.3 88.9 72.7 
93.5 94.5 82.8 



58.2 61 

79.7 81 

87.9 89 

94.2 95 



4 56.2 59.4 

5 78.4 80.3 
4 86.9 88.5 
93.5 94.4 



22.5 36.3 40.6 

31.0 47.9 52.9 
36.7 54.7 59.8 

41.1 59.4 64.7 



39.7 



54.2 
57.9 



63.9 68.0 

73.9 77.6 

78.5 82.2 

81.4 84.9 



56.6 
65.1 
69.5 
72.5 



83.3 85.8 
89.8 91.7 
92.3 93.9 
93.7 95.2 



Fold 



1 
5 
10 

20 



13.9 26.8 

28.8 50.3 

39.5 63.6 

55.8 77.2 



29.7 17.0 

54.2 33.2 

66.7 45.7 

79.4 62.3 



33.8 36.7 17.0 

57.0 60.2 34.2 

70.5 73.0 48.4 

84.5 86.3 66.3 



33.5 35. 

58.6 60. 
74.1 75. 
87.0 87. 



6 31.9 34.2 

7 57.4 59.6 
4 72.5 74.2 

8 85.9 87.0 



10.7 24.8 27.0 

17.3 35.1 38.6 

22.3 41.9 46.0 

26.4 46.9 51.4 



20.0 
28.3 
33.7 
37.8 



46.6 49.8 
57.8 61.4 
63.8 67.6 
67.8 71.6 



32.2 
41.5 
46.9 
51.1 



68.4 70.7 

77.4 79.4 

81.5 83.5 
84.3 86.2 



*: Self (9-18) means the application of self-adaptive strategy 26] of the SFP's length L from 9 to 18. 




TABLE V: Alignment accuracy metric on HOMSTRAD 



500 - 



300 



200 - 



2 3 
percentage of the size of SFP-list 

FIG. 5: Histogram of the percentage of SFP-list's size to the search 
space (molnl*moln2) between all homologous levels (from family to 
fold) under the self-adaptive strategy, where the molnl (moln2) is the 
first (second) structure's size. The peak value is about 1.045% and the 
mean value is about 1.5505%. The total pair count is 50,069. 



Metric 


CLeFAPS(CL) 


CLeFAPS(hp-CL) 


MATT 


C/LOA 1 


0.929 


0.939 


0.948 


C/LOR 2 


0.898 


0.907 


0.831 



1 : Correct/ (Length of the algorithm). 
2 : Correct/ (Length of the reference). 



ments for homologous families [42[. Its alignments were 
generated using structural alignment programs, then fol- 
lowed by a manual scrutiny of individual cases. There are 
totally 1033 families (633 at pairwise level). We'll show 
the improvement based on hp-CL as the coding systems 
instead of CL under the same algorithm, say CLeFAPS, 
in Table M 



IV. DISCUSSION AND FUTURE WORK 



C. Implement of hydropathy conformational letters 
to structural alignment 

We embed hp-CL to the pairwise protein structural 
alignment problem under the framework of CLeFAPS 
26]. Particularly, we first transform each structures to 
its hp-CL strings; then search for both highly specific 
SFPs (SFP.H) that have a high HP-CLESUM score to 
build an initial alignment from the best TM-score [39| 
SFP within TopK, and highly sensitive SFPs (SFP_L) 
that have a low HP-CLESUM score (must above 0) to re- 
fine the alignment through fuzzy-add strategy These two 
SFP-lists can be generated simultaneously [2J] ; finally we 
apply an elongation based on Vect-score to collect local 
flexible fragments. 

HOMSTRAD is a database of protein structural align- 



To explore the joint space of both AAs and SAs, en- 
tropic clustering is a simple but effective approach. In 
this work, only the reduction of AAs is considered, while 
we may also reduce CLs and AAs simultaneously while 
balancing the accuracy and the parameter numbers. For 
example, if reducing the CL to 9 letters, (actually, from 
Fig. [TJ there are 4 codes for helix which can be grouped 
to one cluster, the same as sheet.) we may then consider 
up to four A A cluster instead of two, while the total al- 
phabet number is about the same as hp-CL. 

It is interesting that, hp-CL can be applied during the 
situation we know only a little information about the AA 
sequence of the structure, i.e., the hydropathy features, 
or even none. That is true because, from the knowledge 
of protein design [10,|4l|, the hydropathy patterns from a 
3D structure may probably be deduced. Then the usage 
of hp-CLs and HP-CLESUM that consider the hydropa- 
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thy patterns will get a more accurate result than CLs 
and CLESUM that only consider the 3D structure. 

We also verify a basic idea in CLeFAPS, i.e., self- 
adaptive strategy to generate SFPs, that we needn't con- 
sider the parameters to deal with different purposes and 
different proteins. The result showed its accuracy is 
maintained well and the SFP-list size is controlled in 
0(n 2 /LEN_H/6) while its hard to judge the balance be- 
tween accuracy and size with fixed parameters. 

TopK accuracy check has demonstrated the basic strat- 
egy of both CLePAPS and CLeFAPS efficient, which only 
considers TopK highest SFPs to built the initial align- 
ment. Moreover, TopK accuracy check is an effective 
approach to measure the coding systems against a refer- 
ence dataset, especially to judge the substitution matrix. 
If a coding system is good enough, it should rank those 



SFPs with highly specificities top enough among other 
SFPs. In a future work, we'll use this approach to test 
the current available SAs based on their performance for 
finding specific SFPs. Also we can do the comparison 
between SAs and RMSD values or some p- values derived 
from RMSD. Such comparison between ID coding sys- 
tems with 3D expression will show the effectiveness of 
SAs because they contain the statistic information from 
the database 1241. 
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